ITEC 621 Project

Predicting Median Housing Prices in New York City

Authors
Affiliation

Ledia Dobi

American University

Conie O’Malley

American University

Published

March 5, 2025

1 Library and Package installation

In this section we assigned a vector variable of the packages we used for this project. There is a function to check for the packages and install and library them if they are not already installed and libraried.

Code
# library chunk
required_packages <- c("dplyr", "readr", "tidyverse", "lmtest", "lubridate", "glmnet", "pls", "readxl", "GGally", "boot", "scales", "ggthemes", "caret")

# Function to check if a package is installed
install_and_load <- function(package) {
  if (!require(package, character.only = TRUE)) {
    utils::install.packages(package, dependencies = TRUE)
    library(package, character.only = TRUE)
  }
}

for (pkg in required_packages) {
  install_and_load(pkg)
}

2 Data Preparation

2.1 Data Gathering

We gathered our data from the following sources:

  • Zillow
  • Median Home Sale Price - New Construction Sales
  • Mean Home Value - Rental Cost Index
  • Federal Reserve
    • Federal Reserve Interest Rates
    • 15 Year Mortgage Rates
    • 30 Year Mortgage Rates
    • NY Median Household Income
    • National Median Household Income
    • Unemployment Data
  • New York Police Department
    • Major Felony Offenses
    • Non-Major Felony Offenses
    • Misdemenaor Offenses

Gathering data from a variety of sources allowed us to create a comprehensive analysis of the housing market trends in New York City. Combining additional non-financial data helps to us understand the factors that influence housing prices and crime rates. This analysis can help policymakers, investors, and residents make informed decisions about their financial and housing situations.

Code
# import data sets
median_sale_price <- readr::read_csv("data/median_home_price.csv")
Rows: 21488 Columns: 309
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (6): RegionName, RegionType, StateName, State, Metro, CountyName
dbl (303): RegionID, SizeRank, 2000-01-31, 2000-02-29, 2000-03-31, 2000-04-3...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
new_construction_sales <- readr::read_csv("data/Metro_new_con_sales_count_raw_uc_sfr_month.csv")
Rows: 283 Columns: 88
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): RegionName, RegionType, StateName
dbl (85): RegionID, SizeRank, 2018-01-31, 2018-02-28, 2018-03-31, 2018-04-30...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
mean_sfr_value <- readr::read_csv("data/home_value.csv")
Rows: 895 Columns: 306
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): RegionName, RegionType, StateName
dbl (303): RegionID, SizeRank, 2000-01-31, 2000-02-29, 2000-03-31, 2000-04-3...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
interest_rates <- readr::read_csv("data/fed_interest_rates.csv")
Rows: 301 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): FEDFUNDS
date (1): observation_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
mortgage_rate_15_year <- readr::read_csv("data/mortgage_rates_15_year.csv")
Rows: 1312 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): MORTGAGE15US
date (1): observation_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
mortgage_rate_30_year <- readr::read_csv("data/mortgage_rates_30_year.csv")
Rows: 1312 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): MORTGAGE30US
date (1): observation_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
ny_median_household_income <- readr::read_csv("data/new_york_median_household_income.csv")
Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): MEHOINUSNYA672N
date (1): observation_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
national_median_household_income <- readr::read_csv("data/national_median_household_income.csv")
Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): MEHOINUSA646N
date (1): observation_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
rental_costs <- readr::read_csv("data/rental_index.csv")
Rows: 517 Columns: 126
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): RegionName, RegionType, StateName
dbl (123): RegionID, SizeRank, 2015-01-31, 2015-02-28, 2015-03-31, 2015-04-3...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
misdemeanor_offenses <- readr::read_csv("data/misdemeanor-offenses-2000-2024.csv")
Rows: 18 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): OFFENSE
dbl (25): 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
non_seven_major_felonies <- readr::read_csv("data/non-seven-major-felony-offenses-2000-2024.csv")
Rows: 9 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): OFFENSE
dbl (25): 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
major_felonies <- readr::read_csv("data/seven-major-felony-offenses-2000-2024.csv")
Rows: 8 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): OFFENSE
num (25): 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
unemployment_data <- readr::read_csv("data/unemployment_data.csv")
Rows: 300 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): NYUR
date (1): observation_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.2 Data Cleaning

Our data required extensive cleaning and formatting, despite the records being very clean.

  • Dates - We had to manipulate the date columns for all data sets to assure that all date formats and dates we the same to facilitate data merging. Some data sets were set for the last day of the month and others to the first day of the month. We decided to use the first day of the month which required us to use the lubridate package to adjust dates within the same months. There were other data that were yearly, median household income and crime data, which we computed to monthly values and distributed those throughout the months of the year. While this does not create a perfect representation of the data since there isn’t a way to capture a trend, it helps us in an overall time series analysis as opposed to discarding it based on its periodicity.

  • Dimensions - Many of the data sets were wide data sets that we had to pivot to long data sets to assign variables to each column and have the Date as the joining column, once properly formatted.

  • Missing Values - We removed any NA values from our data set. This lead us to removing variables New Construction Sales and Rental Cost Index because there wasn’t enough data points to thoroughly model. Removing these two variables allowed us to have over 300 observations per variable as opposed to ~ 70 by retaining them.

  • Economic Crises - We included dummy variables for the 2008 financial crisis and the COVID-19 pandemic. We imputed a 0 for months that were not included in these crises and 1 if they were in these crises. We hope that this will capture some of the outside impacts on housing prices that would not be otherwise captured without the inclusion of dummy variables.

    • 2008 Financial Crisis - 2007-12-01 - 2009-06-01
    • COVID-19 Pandemic - 2020-03-01 - 2023-05-01
Code
# remove columns and pivot longer
ny_median_sale_price <- median_sale_price %>% 
  dplyr::filter(RegionName == "New York") %>% 
  dplyr::select(-RegionID, -SizeRank, -RegionType, -StateName, -CountyName, -Metro, -State) %>% 
  tidyr::pivot_longer(cols = -RegionName,
    names_to = "Date",
    values_to = "median_sale_price") %>% 
  dplyr::mutate(
    Date = as.Date(Date, "%Y-%m-%d"), 
    Date = lubridate::floor_date(Date, "month"))

ny_new_construction_sales <- new_construction_sales %>% 
  dplyr::filter(RegionName == "New York, NY") %>% 
  dplyr::select(-RegionID, -SizeRank, -RegionType, -StateName) %>% 
  tidyr::pivot_longer(cols = -RegionName,
    names_to = "Date",
    values_to = "new_construction_sales") %>% 
  dplyr::select(-RegionName) %>% 
  dplyr::mutate(
    Date = as.Date(Date, "%Y-%m-%d"), 
    Date = lubridate::floor_date(Date, "month"))

ny_mean_sfr_value <- mean_sfr_value %>% 
  dplyr::filter(RegionName == "New York, NY") %>% 
  dplyr::select(-RegionID, -SizeRank, -RegionType, -StateName) %>% 
  tidyr::pivot_longer(cols = -RegionName,
    names_to = "Date",
    values_to = "mean_sfr_value") %>% 
  dplyr::select(-RegionName) %>% 
  dplyr::mutate(
    Date = as.Date(Date, "%Y-%m-%d"), 
    Date = lubridate::floor_date(Date, "month"))

rental_costs <- rental_costs %>% 
  dplyr::filter(RegionName == "New York, NY") %>% 
  dplyr::select(-RegionID, -SizeRank, -RegionType, -StateName) %>% 
  tidyr::pivot_longer(cols = -RegionName,
    names_to = "Date",
    values_to = "mean_rental_price") %>% 
  dplyr::mutate(
    Date = as.Date(Date, "%Y-%m-%d"), 
    Date = lubridate::floor_date(Date, "month"))
Code
# rename date and variable columns
interest_rates <- interest_rates %>% 
  dplyr::rename(Date = observation_date, Fed_Interest_Rate = FEDFUNDS) %>% 
  dplyr::mutate(Date = as.Date(Date, format = "%Y-%m-%d"))

# adjust date column
mortgage_rate_15_year <- mortgage_rate_15_year %>% 
  dplyr::rename(Date = observation_date, mortgage_rate_15_year = MORTGAGE15US) %>% 
  dplyr::mutate(Date = as.Date(Date, format = "%Y-%m-%d"))

# calculate monthly average interest rates - weekly data was provided
# converted to monthly average for analysis
monthly_avg_15_year <- mortgage_rate_15_year %>% 
  dplyr::mutate(YearMonth = lubridate::floor_date(Date, "month")) %>% 
  dplyr::group_by(YearMonth) %>% 
  dplyr::summarise(monthly_avg_15_year = mean(mortgage_rate_15_year, na.rm = TRUE)) %>% 
  dplyr::mutate(Date = as.Date(format(YearMonth, "%Y-%m-01"), format = "%Y-%m-%d")) %>% 
  dplyr::select(Date, monthly_avg_15_year)

# date debugging - there was a problem in merging data sets, more date formatting was 
# required to fix issues
mortgage_rate_15_year <- mortgage_rate_15_year %>%
  tidyr::complete(Date = seq(min(interest_rates$Date), 
                            max(interest_rates$Date), by = "month")) %>%
  tidyr::fill(everything(), .direction = "down")

# adjust date column
mortgage_rate_30_year <- mortgage_rate_30_year %>% 
  dplyr::rename(Date = observation_date, mortgage_rate_30_year = MORTGAGE30US) %>% 
  dplyr::mutate(Date = as.Date(Date, format = "%Y-%m-%d"))

# calculate monthly average interest rates - weekly data was provided
# converted to monthly average for analysis
monthly_avg_30_year <- mortgage_rate_30_year %>% 
  dplyr::mutate(YearMonth = lubridate::floor_date(Date, "month")) %>% 
  dplyr::group_by(YearMonth) %>% 
  dplyr::summarise(monthly_avg_30_year = mean(mortgage_rate_30_year, na.rm = TRUE)) %>% 
  dplyr::mutate(Date = as.Date(format(YearMonth, "%Y-%m-01"), format = "%Y-%m-%d")) %>% 
  dplyr::select(Date, monthly_avg_30_year)

# date debugging - there was a problem in merging data sets, more date formatting was 
# required to fix issues
mortgage_rate_30_year <- mortgage_rate_30_year %>%
  tidyr::complete(Date = seq(min(interest_rates$Date), 
                            max(interest_rates$Date), by = "month")) %>%
  tidyr::fill(everything(), .direction = "down")

# adjust date column
ny_median_household_income <- ny_median_household_income %>% 
  dplyr::rename(Date = observation_date, ny_median_hh_income = MEHOINUSNYA672N) %>% 
  dplyr::mutate(Date = as.Date(Date, format = "%Y-%m-%d"))

# distribute yearly data down to individual months for analysis purposes
# assign yearly values and grouping
ny_median_household_income_yearly <- ny_median_household_income %>%
  dplyr::mutate(Year = lubridate::year(Date)) %>% 
  dplyr::select(Year, ny_median_hh_income) %>% 
  dplyr::group_by(Year) %>% 
  dplyr::summarise(ny_median_hh_income = dplyr::first(ny_median_hh_income), 
                  .groups = 'drop') 

# adjust to monthly and join financial data back                   
ny_median_household_income_monthly <- tidyr::expand_grid(
  Year = ny_median_household_income_yearly$Year,
  Month = 1:12) %>% 
  dplyr::left_join(ny_median_household_income_yearly, by = "Year") %>% 
  dplyr::mutate(
    Date = as.Date(sprintf("%d-%02d-01", Year, Month)),
    ny_median_hh_income = ny_median_hh_income/12) %>% 
  dplyr::select(Date, ny_median_hh_income) %>% 
  dplyr::arrange(Date)

# adjust date column
national_median_household_income <- national_median_household_income %>% 
  dplyr::rename(Date = observation_date, national_median_hh_income = MEHOINUSA646N) %>% 
  dplyr::mutate(Date = as.Date(Date, format = "%Y-%m-%d"))

# distribute yearly data down to individual months for analysis purposes
# assign yearly values and grouping
national_median_household_income_yearly <- national_median_household_income %>%
  dplyr::mutate(Year = lubridate::year(Date)) %>% 
  dplyr::select(Year, national_median_hh_income) %>% 
  dplyr::group_by(Year) %>% 
  dplyr::summarise(national_median_hh_income = dplyr::first(national_median_hh_income), 
                  .groups = 'drop') 

# adjust to monthly and join financial data back               
national_median_household_income_monthly <- tidyr::expand_grid(
  Year = national_median_household_income_yearly$Year,
  Month = 1:12) %>% 
  dplyr::left_join(national_median_household_income_yearly, by = "Year") %>% 
  dplyr::mutate(
    Date = as.Date(sprintf("%d-%02d-01", Year, Month)),
    national_median_hh_income = national_median_hh_income/12) %>% 
  dplyr::select(Date, national_median_hh_income) %>% 
  dplyr::arrange(Date)
Code
# pull totals from each data set
total_misdemeanors <- utils::tail(misdemeanor_offenses, 1)
total_non_seven_felonies <- utils::tail(non_seven_major_felonies, 1)
total_major_felonies <- utils::tail(major_felonies, 1)

# add category identifiers
total_misdemeanors <- total_misdemeanors %>%
  dplyr::mutate(category = "misdemeanor_offenses") %>% 
  tidyr::pivot_longer(
    cols = dplyr::starts_with("20"),
    names_to = "date",
    values_to = "misdemeanor_offenses"
  ) %>% 
  dplyr::select(-OFFENSE, -category)

total_non_seven_felonies <- total_non_seven_felonies %>%
  dplyr::mutate(category = "non_seven_major_felonies") %>%
  tidyr::pivot_longer(
    cols = dplyr::starts_with("20"),
    names_to = "date",
    values_to = "non_seven_major_felonies"
  ) %>% 
  dplyr::select(-OFFENSE, -category)

total_major_felonies <- total_major_felonies %>%
  dplyr::mutate(category = "major_felonies") %>% 
  tidyr::pivot_longer(
    cols = dplyr::starts_with("20"),
    names_to = "date",
    values_to = "major_felonies"
  ) %>% 
  dplyr::select(-OFFENSE, -category)

# join all crime data
total_crime_commissions <- total_misdemeanors %>% 
  dplyr::left_join(total_non_seven_felonies, by = "date") %>% 
  dplyr::left_join(total_major_felonies, by = "date") %>% 
  dplyr::mutate(Date = as.Date(paste0(substr(date, 1, 4), "-01-01")))


total_crime_commissions_yearly <- total_crime_commissions %>%
  dplyr::select(-date) %>% 
  dplyr::mutate(Year = lubridate::year(Date)) %>% 
  dplyr::select(Year, non_seven_major_felonies, major_felonies, misdemeanor_offenses) %>% 
  dplyr::group_by(Year)

# adjust to monthly and join financial data back                   
total_crime_commissions_monthly <- tidyr::expand_grid(
  Year = total_crime_commissions_yearly$Year,
  Month = 1:12) %>% 
  dplyr::left_join(total_crime_commissions_yearly, by = "Year") %>% 
  dplyr::mutate(
    Date = as.Date(sprintf("%d-%02d-01", Year, Month)),
    non_seven_major_felonies = non_seven_major_felonies/12,
    major_felonies = major_felonies/12,
    misdemeanor_offenses = misdemeanor_offenses/12) %>% 
  dplyr::select(Date, non_seven_major_felonies, major_felonies, misdemeanor_offenses) %>% 
  dplyr::arrange(Date)

# manipulate unemployement data
unemployment_data <- unemployment_data %>% 
  dplyr::mutate(Date = as.Date(observation_date)) %>% 
  dplyr::rename(unemployment_rate = NYUR) %>% 
  dplyr::select(-observation_date) %>% 
  dplyr::arrange(Date)

2.3 Dataset Joins

Once our data manipulation was complete, we joined the data set to give us a final count of 14 variables and 300 observations.

  • Date
  • mean_sfr_value - estimated mean value of homes
  • median_sale_price - median sale price of homes in a specified month
  • Fed_Interest_Rate - Federal Reserve Interest Rate
    • calculated by monthly average - \((week_1 + ... + week_n) / n\) for each month
  • mortgage_rate_15_year - average 15 year mortgage rate in a specified month
  • mortgage_rate_30_year - average 30 year mortgage rate in a specified month
  • ny_median_hh_income - median household income for residents of New York State
    • calculated as n/12
  • national_median_hh_income - median household income for residents of the United States
    • calculated as n/12
  • non_seven_major_felonies - non violent felony commissions in NYC
    • calculated as n/12
  • major_felonies - violent felony commissions in NYC
    • calculated as n/12
  • misdemeanor_offenses - misdemeanor commissions in NYC
    • calculated as n/12
  • unemployment_rate - national unemployment rate in a specified month
  • housing_crisis - 2008 financial crisis dummy variable
    • 0 = non-crisis, 1 = crisis
  • covid_pandemic - COVID-19 pandemic dummy variable
    • 0 = non-pandemic, 1 = pandemic
Code
# join data sets
ny_housing_data <- ny_mean_sfr_value %>% 
  dplyr::left_join(ny_median_sale_price, by = "Date") %>% 
  dplyr::left_join(interest_rates, by = "Date") %>% 
  dplyr::left_join(mortgage_rate_15_year, by = "Date") %>% 
  dplyr::left_join(mortgage_rate_30_year, by = "Date") %>% 
  dplyr::left_join(ny_median_household_income_monthly, by = "Date") %>% 
  dplyr::left_join(national_median_household_income_monthly, by = "Date") %>% 
  dplyr::left_join(total_crime_commissions_monthly, by = "Date") %>% 
  dplyr::left_join(unemployment_data, by = "Date") %>% 
  dplyr::mutate(housing_crisis = ifelse(Date >= as.Date("2007-12-01") &
                                        Date <= as.Date("2009-06-01"), 1, 0)) %>% 
  dplyr::mutate(covid_pandemic = ifelse(Date >= as.Date("2020-03-01") &
                                        Date <= as.Date("2023-05-01"), 1, 0)) %>% 
  dplyr::select(-RegionName)

# Convert Date from character to Date class
ny_housing_data$Date <- as.Date(ny_housing_data$Date)

# remove NA values
ny_housing_data_clean <- stats::na.omit(ny_housing_data)

# writing clean data to csv for common usage in future
# commented out for future coding purposes
#readr::write_csv(ny_housing_data_clean, "data/ny_housing_data_clean.csv")
ny_housing_data_clean <- readr::read_csv("data/ny_housing_data_clean.csv")
New names:
Rows: 280 Columns: 15
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(14): ...1, mean_sfr_value, median_sale_price, Fed_Interest_Rate, mortg... date
(1): Date
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
# plot a reduced lm model
lm_model <- stats::lm(median_sale_price ~ ., data = ny_housing_data_clean)
graphics::plot(lm_model, which = 2)

Code
graphics::hist(lm_model$residuals)

Code
summary(lm_model)

Call:
stats::lm(formula = median_sale_price ~ ., data = ny_housing_data_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-23306.1  -5769.8    435.6   4436.0  24177.8 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -8.365e+05  8.063e+06  -0.104    0.917    
...1                      -2.340e+03  2.197e+04  -0.107    0.915    
Date                       9.751e+01  7.216e+02   0.135    0.893    
mean_sfr_value             6.852e-01  1.711e-02  40.053  < 2e-16 ***
Fed_Interest_Rate          4.915e+03  8.410e+02   5.844 1.49e-08 ***
mortgage_rate_15_year      4.445e+04  6.366e+03   6.983 2.33e-11 ***
mortgage_rate_30_year     -4.859e+04  6.032e+03  -8.056 2.70e-14 ***
ny_median_hh_income        1.786e+01  3.950e+00   4.522 9.24e-06 ***
national_median_hh_income -5.965e-01  6.315e+00  -0.094    0.925    
non_seven_major_felonies  -1.515e+00  3.251e+00  -0.466    0.642    
major_felonies            -4.458e+00  1.121e+00  -3.977 9.00e-05 ***
misdemeanor_offenses      -3.946e+00  5.504e-01  -7.170 7.50e-12 ***
unemployment_rate         -2.781e+03  4.929e+02  -5.642 4.32e-08 ***
housing_crisis             1.520e+04  2.812e+03   5.405 1.45e-07 ***
covid_pandemic             1.707e+04  3.067e+03   5.565 6.43e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8180 on 265 degrees of freedom
Multiple R-squared:  0.9968,    Adjusted R-squared:  0.9966 
F-statistic:  5886 on 14 and 265 DF,  p-value: < 2.2e-16
Code
#heteroskedasticity check
graphics::plot(lm_model$residuals ~ lm_model$fitted.values, 
      main = "Heteroskedastic Residuals", 
      xlab = "Predicted Values", 
      ylab = "Residuals")
graphics::abline(h=0, col="red")

Code
#residuals vs fitted plot
graphics::plot(lm_model, which = 1)

Code
#checking for serial correlation
lmtest::dwtest(lm_model)

    Durbin-Watson test

data:  lm_model
DW = 0.36236, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
Code
#run ggpairs
ggpairs(ny_housing_data_clean)

3 Modeling

3.1 Model Method 1 - OLS

Since we have concluded that our data suffers from serial correlation, we must transform our variables and lag the data. Therefore, we will use logistic regression to build our model. We need to review this statement

3.1.1 Model Characteristics

Variables Included

  • Date
  • mean_sfr_value
  • Fed_Interest_Rate
  • mortgage_rate_15_year
  • mortgage_rate_30_year
  • ny_median_hh_income
  • major_felonies
  • misdemeanor_offenses
  • unemployment_rate
  • housing_crisis
  • covid_pandemic

Variables Excluded

  • national_median_hh_income
  • non_seven_major_felonies
  • Date.L1
  • Date.L3

3.1.2 Initial Findings:

  • Heteroskedasticity - There was evidence of heteroskedasticity in the residuals, as shown by the “Heteroskedastic Residuals” plot, indicating that the error variance was not constant. Serial correlation was detected using the Durbin-Watson test, with positive correlation in the residuals due to the time series nature of the data.

  • Nonnormal distribution of residuals - The residuals deviated from normality, as seen in the Q-Q plot and histogram, which suggested non-normal distribution.

3.1.3 Challenges and Adjustments:

  • Lagging - We tried transforming and lagging date-related variables but found significant correlation persisted. We also adjusted for different lagging periods.
  • Second Model Choice - We will use a Weighted Least Squares (WLS) as an alternative to manage non-constant variance more effectively as our second model method. We will combine this with stepwise regression to further enhance variable selection.
Code
ny_housing_data_clean <- ny_housing_data_clean %>%
  dplyr::arrange(Date) %>%
  dplyr::mutate(
    Date.L1 = Date %m+% months(-1),  # Lag by 1 month
    Date.L3 = Date %m+% months(-3) # Lag by 3 months
  )

#regression with lagged variables
lm_model_lag <- stats::lm(median_sale_price ~ ., data = ny_housing_data_clean)

summary(lm_model_lag)

Call:
stats::lm(formula = median_sale_price ~ ., data = ny_housing_data_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-22319.5  -5379.8    402.6   4440.9  24837.4 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1.186e+07  1.205e+07   0.984    0.326    
...1                       3.248e+04  3.296e+04   0.985    0.325    
Date                       2.899e+02  7.508e+02   0.386    0.700    
mean_sfr_value             6.805e-01  1.745e-02  38.992  < 2e-16 ***
Fed_Interest_Rate          4.884e+03  8.423e+02   5.798 1.92e-08 ***
mortgage_rate_15_year      4.459e+04  6.398e+03   6.970 2.55e-11 ***
mortgage_rate_30_year     -4.879e+04  6.060e+03  -8.050 2.88e-14 ***
ny_median_hh_income        1.726e+01  3.973e+00   4.346 1.99e-05 ***
national_median_hh_income  2.397e-01  6.374e+00   0.038    0.970    
non_seven_major_felonies  -9.270e-01  3.277e+00  -0.283    0.777    
major_felonies            -4.672e+00  1.136e+00  -4.113 5.23e-05 ***
misdemeanor_offenses      -4.031e+00  5.538e-01  -7.280 3.88e-12 ***
unemployment_rate         -2.812e+03  4.936e+02  -5.697 3.25e-08 ***
housing_crisis             1.540e+04  2.816e+03   5.470 1.05e-07 ***
covid_pandemic             1.778e+04  3.110e+03   5.717 2.94e-08 ***
Date.L1                   -5.992e+02  7.761e+02  -0.772    0.441    
Date.L3                   -7.373e+02  7.550e+02  -0.976    0.330    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8180 on 263 degrees of freedom
Multiple R-squared:  0.9968,    Adjusted R-squared:  0.9966 
F-statistic:  5151 on 16 and 263 DF,  p-value: < 2.2e-16
Code
final_lm_model <- stats::lm(median_sale_price ~ - Date + mean_sfr_value + 
                            Fed_Interest_Rate + mortgage_rate_15_year + mortgage_rate_30_year + ny_median_hh_income + major_felonies + misdemeanor_offenses + unemployment_rate + housing_crisis + covid_pandemic, data = ny_housing_data_clean)

summary(final_lm_model)

Call:
stats::lm(formula = median_sale_price ~ -Date + mean_sfr_value + 
    Fed_Interest_Rate + mortgage_rate_15_year + mortgage_rate_30_year + 
    ny_median_hh_income + major_felonies + misdemeanor_offenses + 
    unemployment_rate + housing_crisis + covid_pandemic, data = ny_housing_data_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-34397 -11673     98  11672  40606 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            9.789e+03  6.145e+04   0.159 0.873564    
mean_sfr_value         8.417e-01  2.708e-02  31.079  < 2e-16 ***
Fed_Interest_Rate      1.217e+04  1.471e+03   8.277 5.93e-15 ***
mortgage_rate_15_year -5.480e+04  9.742e+03  -5.625 4.65e-08 ***
mortgage_rate_30_year  3.877e+04  9.661e+03   4.013 7.79e-05 ***
ny_median_hh_income    5.898e+01  6.055e+00   9.739  < 2e-16 ***
major_felonies        -1.095e+01  1.481e+00  -7.392 1.83e-12 ***
misdemeanor_offenses  -2.649e+00  7.609e-01  -3.482 0.000582 ***
unemployment_rate      6.313e+02  9.046e+02   0.698 0.485825    
housing_crisis         2.237e+04  5.220e+03   4.284 2.56e-05 ***
covid_pandemic         2.805e+04  5.971e+03   4.698 4.20e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16080 on 269 degrees of freedom
Multiple R-squared:  0.9874,    Adjusted R-squared:  0.987 
F-statistic:  2111 on 10 and 269 DF,  p-value: < 2.2e-16
Code
graphics::plot(final_lm_model, which = 2)

Code
graphics::hist(final_lm_model$residuals)

Code
#heteroskedasticity check
graphics::plot(final_lm_model$residuals ~ final_lm_model$fitted.values, 
      main = "Heteroskedastic Residuals", 
      xlab = "Predicted Values", 
      ylab = "Residuals")
graphics::abline(h=0, col="red")

Code
#residuals vs fitted plot
graphics::plot(final_lm_model, which = 1)

Code
lmtest::dwtest(final_lm_model)

    Durbin-Watson test

data:  final_lm_model
DW = 0.24442, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

3.2 Model 2 - WLS Stepwise Regression

3.2.1 Model Characteristics

Variables Included

  • Date
  • mean_sfr_value
  • Fed_Interest_Rate
  • mortgage_rate_15_year
  • mortgage_rate_30_year
  • ny_median_hh_income
  • major_felonies
  • misdemeanor_offenses
  • unemployment_rate
  • housing_crisis
  • covid_pandemic

Variables Excluded

  • national_median_hh_income
  • non_seven_major_felonies
  • Date.L1
  • Date.L3

3.2.2 Initial Findings

  • Summary of Findings - The WLS model helped moderate the impact of variance differences. However, the persistent issue of serial correlation needed further examination.
  • Stepwise Variable Selection - The stepwise regression confirmed the statistical significance of all variables, indicating that they all contribute to the prediction of median sale price.

3.2.3 Challenges and Adjustments

  • Heteroskedasticity - We will consider using a logarithmic transformation to adjust variables that might be having an imbalanced affect on the model - particularly the yearly data that we distributed monthly.
Code
# calculate WLS model weights
lm_model_wls_weights <- 1 / stats::fitted(stats::lm(abs(stats::residuals(lm_model_lag)) ~ stats::fitted(lm_model_lag)))^2

# fit a WLS model
wls_model <- stats::lm(median_sale_price ~., data = ny_housing_data_clean,
                              weights = lm_model_wls_weights)

summary(wls_model)

Call:
stats::lm(formula = median_sale_price ~ ., data = ny_housing_data_clean, 
    weights = lm_model_wls_weights)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-4.0296 -0.8398 -0.0172  0.9778  4.0086 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1.479e+07  1.045e+07   1.416 0.157900    
...1                       4.072e+04  2.857e+04   1.425 0.155266    
Date                       2.843e+02  6.515e+02   0.436 0.662909    
mean_sfr_value             6.877e-01  1.536e-02  44.776  < 2e-16 ***
Fed_Interest_Rate          3.273e+03  6.731e+02   4.863 1.99e-06 ***
mortgage_rate_15_year      4.714e+04  5.988e+03   7.873 9.10e-14 ***
mortgage_rate_30_year     -4.929e+04  5.792e+03  -8.511 1.34e-15 ***
ny_median_hh_income        1.450e+01  3.368e+00   4.306 2.34e-05 ***
national_median_hh_income -6.157e-01  6.331e+00  -0.097 0.922603    
non_seven_major_felonies   6.318e+00  2.592e+00   2.438 0.015446 *  
major_felonies            -4.066e+00  1.087e+00  -3.741 0.000225 ***
misdemeanor_offenses      -4.796e+00  4.697e-01 -10.212  < 2e-16 ***
unemployment_rate         -2.136e+03  5.583e+02  -3.825 0.000163 ***
housing_crisis             1.236e+04  2.462e+03   5.021 9.48e-07 ***
covid_pandemic             1.475e+04  3.779e+03   3.904 0.000120 ***
Date.L1                   -2.234e+02  6.724e+02  -0.332 0.739916    
Date.L3                   -1.376e+03  6.568e+02  -2.095 0.037085 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.298 on 263 degrees of freedom
Multiple R-squared:  0.9972,    Adjusted R-squared:  0.997 
F-statistic:  5786 on 16 and 263 DF,  p-value: < 2.2e-16
Code
# fit a null model
wls_model_null <- stats::lm(median_sale_price ~ 1, data = ny_housing_data_clean)

# run a stepwise regression
wls_model_step <- stats::step(wls_model_null, direction = "both", 
                              scope = stats::formula(wls_model))
Start:  AIC=6639.88
median_sale_price ~ 1

                            Df  Sum of Sq        RSS    AIC
+ national_median_hh_income  1 5.0958e+12 4.3588e+11 5930.4
+ mean_sfr_value             1 5.0769e+12 4.5478e+11 5942.3
+ ...1                       1 4.7470e+12 7.8469e+11 6095.1
+ Date.L3                    1 4.7470e+12 7.8472e+11 6095.1
+ Date                       1 4.7469e+12 7.8473e+11 6095.1
+ Date.L1                    1 4.7469e+12 7.8473e+11 6095.1
+ misdemeanor_offenses       1 4.2501e+12 1.2816e+12 6232.4
+ ny_median_hh_income        1 3.9147e+12 1.6170e+12 6297.5
+ non_seven_major_felonies   1 2.7583e+12 2.7734e+12 6448.6
+ covid_pandemic             1 2.2362e+12 3.2955e+12 6496.9
+ major_felonies             1 2.0818e+12 3.4498e+12 6509.7
+ mortgage_rate_30_year      1 1.5220e+12 4.0096e+12 6551.8
+ mortgage_rate_15_year      1 1.4463e+12 4.0854e+12 6557.0
+ unemployment_rate          1 1.8047e+11 5.3512e+12 6632.6
<none>                                    5.5317e+12 6639.9
+ housing_crisis             1 1.8777e+10 5.5129e+12 6640.9
+ Fed_Interest_Rate          1 9.7371e+08 5.5307e+12 6641.8

Step:  AIC=5930.44
median_sale_price ~ national_median_hh_income

                            Df  Sum of Sq        RSS    AIC
+ mean_sfr_value             1 2.9272e+11 1.4316e+11 5620.7
+ major_felonies             1 7.9303e+10 3.5658e+11 5876.2
+ misdemeanor_offenses       1 5.1189e+10 3.8469e+11 5897.5
+ ny_median_hh_income        1 2.6600e+10 4.0928e+11 5914.8
+ housing_crisis             1 2.6003e+10 4.0988e+11 5915.2
+ mortgage_rate_30_year      1 1.1958e+10 4.2392e+11 5924.6
+ non_seven_major_felonies   1 9.6041e+09 4.2628e+11 5926.2
+ ...1                       1 7.9685e+09 4.2791e+11 5927.3
+ Date                       1 7.9673e+09 4.2791e+11 5927.3
+ Date.L1                    1 7.9658e+09 4.2792e+11 5927.3
+ Date.L3                    1 7.9631e+09 4.2792e+11 5927.3
+ mortgage_rate_15_year      1 5.9463e+09 4.2994e+11 5928.6
+ unemployment_rate          1 5.2846e+09 4.3060e+11 5929.0
<none>                                    4.3588e+11 5930.4
+ Fed_Interest_Rate          1 2.8257e+09 4.3306e+11 5930.6
+ covid_pandemic             1 2.3797e+09 4.3350e+11 5930.9
- national_median_hh_income  1 5.0958e+12 5.5317e+12 6639.9

Step:  AIC=5620.69
median_sale_price ~ national_median_hh_income + mean_sfr_value

                            Df  Sum of Sq        RSS    AIC
+ misdemeanor_offenses       1 5.8593e+10 8.4569e+10 5475.3
+ major_felonies             1 5.1411e+10 9.1752e+10 5498.1
+ non_seven_major_felonies   1 4.3444e+10 9.9718e+10 5521.4
+ mortgage_rate_30_year      1 3.7212e+10 1.0595e+11 5538.4
+ Date                       1 3.1488e+10 1.1167e+11 5553.1
+ Date.L1                    1 3.1487e+10 1.1167e+11 5553.1
+ Date.L3                    1 3.1485e+10 1.1168e+11 5553.1
+ ...1                       1 3.1482e+10 1.1168e+11 5553.2
+ mortgage_rate_15_year      1 3.1070e+10 1.1209e+11 5554.2
+ ny_median_hh_income        1 2.5615e+10 1.1755e+11 5567.5
+ Fed_Interest_Rate          1 1.0991e+10 1.3217e+11 5600.3
<none>                                    1.4316e+11 5620.7
+ covid_pandemic             1 3.4872e+08 1.4281e+11 5622.0
+ housing_crisis             1 1.6508e+08 1.4300e+11 5622.4
+ unemployment_rate          1 8.0250e+06 1.4315e+11 5622.7
- mean_sfr_value             1 2.9272e+11 4.3588e+11 5930.4
- national_median_hh_income  1 3.1161e+11 4.5478e+11 5942.3

Step:  AIC=5475.29
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses

                            Df  Sum of Sq        RSS    AIC
+ major_felonies             1 3.3975e+10 5.0594e+10 5333.4
+ Date                       1 2.2898e+10 6.1670e+10 5388.9
+ Date.L1                    1 2.2898e+10 6.1671e+10 5388.9
+ ...1                       1 2.2896e+10 6.1673e+10 5388.9
+ Date.L3                    1 2.2896e+10 6.1673e+10 5388.9
+ mortgage_rate_30_year      1 1.3496e+10 7.1073e+10 5428.6
+ mortgage_rate_15_year      1 1.0485e+10 7.4084e+10 5440.2
+ covid_pandemic             1 6.9576e+09 7.7611e+10 5453.3
+ non_seven_major_felonies   1 6.8407e+09 7.7728e+10 5453.7
+ housing_crisis             1 4.8407e+09 7.9728e+10 5460.8
+ Fed_Interest_Rate          1 1.7324e+09 8.2836e+10 5471.5
<none>                                    8.4569e+10 5475.3
+ ny_median_hh_income        1 4.8949e+08 8.4079e+10 5475.7
+ unemployment_rate          1 2.4636e+08 8.4322e+10 5476.5
- misdemeanor_offenses       1 5.8593e+10 1.4316e+11 5620.7
- national_median_hh_income  1 7.9120e+10 1.6369e+11 5658.2
- mean_sfr_value             1 3.0012e+11 3.8469e+11 5897.5

Step:  AIC=5333.45
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies

                            Df  Sum of Sq        RSS    AIC
+ Fed_Interest_Rate          1 1.0475e+10 4.0119e+10 5270.5
+ unemployment_rate          1 8.9617e+09 4.1632e+10 5280.9
+ ny_median_hh_income        1 7.2417e+09 4.3352e+10 5292.2
+ non_seven_major_felonies   1 6.6369e+09 4.3957e+10 5296.1
+ housing_crisis             1 1.4700e+09 4.9124e+10 5327.2
+ mortgage_rate_15_year      1 1.4055e+09 4.9188e+10 5327.6
+ mortgage_rate_30_year      1 4.5200e+08 5.0142e+10 5332.9
+ Date.L3                    1 3.9918e+08 5.0195e+10 5333.2
+ ...1                       1 3.9848e+08 5.0195e+10 5333.2
+ Date.L1                    1 3.9788e+08 5.0196e+10 5333.2
+ Date                       1 3.9756e+08 5.0196e+10 5333.2
<none>                                    5.0594e+10 5333.4
+ covid_pandemic             1 9.7254e+07 5.0496e+10 5334.9
- major_felonies             1 3.3975e+10 8.4569e+10 5475.3
- misdemeanor_offenses       1 4.1158e+10 9.1752e+10 5498.1
- national_median_hh_income  1 7.5271e+10 1.2586e+11 5586.6
- mean_sfr_value             1 2.7457e+11 3.2516e+11 5852.4

Step:  AIC=5270.49
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies + Fed_Interest_Rate

                            Df  Sum of Sq        RSS    AIC
+ mortgage_rate_30_year      1 2.9755e+09 3.7143e+10 5250.9
+ unemployment_rate          1 2.5529e+09 3.7566e+10 5254.1
+ housing_crisis             1 2.4122e+09 3.7707e+10 5255.1
+ covid_pandemic             1 2.3009e+09 3.7818e+10 5256.0
+ mortgage_rate_15_year      1 1.7205e+09 3.8398e+10 5260.2
+ ny_median_hh_income        1 1.5620e+09 3.8557e+10 5261.4
+ Date                       1 1.3582e+09 3.8761e+10 5262.8
+ Date.L1                    1 1.3578e+09 3.8761e+10 5262.9
+ Date.L3                    1 1.3560e+09 3.8763e+10 5262.9
+ ...1                       1 1.3557e+09 3.8763e+10 5262.9
+ non_seven_major_felonies   1 4.8187e+08 3.9637e+10 5269.1
<none>                                    4.0119e+10 5270.5
- Fed_Interest_Rate          1 1.0475e+10 5.0594e+10 5333.4
- major_felonies             1 4.2717e+10 8.2836e+10 5471.5
- misdemeanor_offenses       1 4.7888e+10 8.8007e+10 5488.4
- national_median_hh_income  1 7.7250e+10 1.1737e+11 5569.1
- mean_sfr_value             1 1.8211e+11 2.2223e+11 5747.8

Step:  AIC=5250.91
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies + Fed_Interest_Rate + 
    mortgage_rate_30_year

                            Df  Sum of Sq        RSS    AIC
+ housing_crisis             1 6.1815e+09 3.0962e+10 5201.9
+ ny_median_hh_income        1 4.7305e+09 3.2413e+10 5214.8
+ mortgage_rate_15_year      1 4.0183e+09 3.3125e+10 5220.9
+ unemployment_rate          1 2.1549e+09 3.4989e+10 5236.2
+ covid_pandemic             1 1.4601e+09 3.5683e+10 5241.7
+ non_seven_major_felonies   1 1.4353e+09 3.5708e+10 5241.9
<none>                                    3.7143e+10 5250.9
+ Date                       1 9.4972e+07 3.7048e+10 5252.2
+ Date.L1                    1 9.4826e+07 3.7049e+10 5252.2
+ ...1                       1 9.4307e+07 3.7049e+10 5252.2
+ Date.L3                    1 9.4209e+07 3.7049e+10 5252.2
- mortgage_rate_30_year      1 2.9755e+09 4.0119e+10 5270.5
- Fed_Interest_Rate          1 1.2998e+10 5.0142e+10 5332.9
- major_felonies             1 2.6920e+10 6.4064e+10 5401.5
- misdemeanor_offenses       1 3.9465e+10 7.6609e+10 5451.6
- national_median_hh_income  1 7.2922e+10 1.1007e+11 5553.1
- mean_sfr_value             1 1.8324e+11 2.2038e+11 5747.5

Step:  AIC=5201.95
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies + Fed_Interest_Rate + 
    mortgage_rate_30_year + housing_crisis

                            Df  Sum of Sq        RSS    AIC
+ ny_median_hh_income        1 3.7029e+09 2.7259e+10 5168.3
+ unemployment_rate          1 1.4248e+09 2.9537e+10 5190.8
+ mortgage_rate_15_year      1 1.4015e+09 2.9560e+10 5191.0
+ Date                       1 1.0129e+09 2.9949e+10 5194.6
+ ...1                       1 1.0126e+09 2.9949e+10 5194.6
+ Date.L1                    1 1.0124e+09 2.9950e+10 5194.6
+ Date.L3                    1 1.0108e+09 2.9951e+10 5194.7
+ covid_pandemic             1 9.0315e+08 3.0059e+10 5195.7
+ non_seven_major_felonies   1 2.2315e+08 3.0739e+10 5201.9
<none>                                    3.0962e+10 5201.9
- housing_crisis             1 6.1815e+09 3.7143e+10 5250.9
- mortgage_rate_30_year      1 6.7447e+09 3.7707e+10 5255.1
- Fed_Interest_Rate          1 1.8022e+10 4.8984e+10 5328.4
- major_felonies             1 1.9364e+10 5.0326e+10 5336.0
- misdemeanor_offenses       1 4.4004e+10 7.4965e+10 5447.5
- national_median_hh_income  1 7.5839e+10 1.0680e+11 5546.6
- mean_sfr_value             1 1.4933e+11 1.8029e+11 5693.3

Step:  AIC=5168.28
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies + Fed_Interest_Rate + 
    mortgage_rate_30_year + housing_crisis + ny_median_hh_income

                            Df  Sum of Sq        RSS    AIC
+ Date                       1 3.0548e+09 2.4204e+10 5137.0
+ Date.L1                    1 3.0540e+09 2.4205e+10 5137.0
+ ...1                       1 3.0535e+09 2.4206e+10 5137.0
+ Date.L3                    1 3.0519e+09 2.4207e+10 5137.0
+ covid_pandemic             1 1.5387e+09 2.5720e+10 5154.0
+ mortgage_rate_15_year      1 5.7199e+08 2.6687e+10 5164.3
+ non_seven_major_felonies   1 3.2162e+08 2.6937e+10 5167.0
<none>                                    2.7259e+10 5168.3
+ unemployment_rate          1 1.8040e+08 2.7079e+10 5168.4
- ny_median_hh_income        1 3.7029e+09 3.0962e+10 5201.9
- housing_crisis             1 5.1538e+09 3.2413e+10 5214.8
- misdemeanor_offenses       1 7.3789e+09 3.4638e+10 5233.4
- mortgage_rate_30_year      1 9.6803e+09 3.6939e+10 5251.4
- Fed_Interest_Rate          1 1.3765e+10 4.1024e+10 5280.7
- major_felonies             1 1.4687e+10 4.1946e+10 5287.0
- national_median_hh_income  1 5.5375e+10 8.2634e+10 5476.8
- mean_sfr_value             1 1.5220e+11 1.7946e+11 5694.0

Step:  AIC=5137
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies + Fed_Interest_Rate + 
    mortgage_rate_30_year + housing_crisis + ny_median_hh_income + 
    Date

                            Df  Sum of Sq        RSS    AIC
+ mortgage_rate_15_year      1 3.4905e+09 2.0714e+10 5095.4
+ covid_pandemic             1 2.4160e+09 2.1788e+10 5109.6
+ non_seven_major_felonies   1 2.7520e+08 2.3929e+10 5135.8
<none>                                    2.4204e+10 5137.0
- national_median_hh_income  1 2.3760e+08 2.4442e+10 5137.7
+ Date.L3                    1 4.2503e+07 2.4162e+10 5138.5
+ unemployment_rate          1 2.8017e+07 2.4176e+10 5138.7
+ ...1                       1 1.4914e+07 2.4189e+10 5138.8
+ Date.L1                    1 4.0480e+06 2.4200e+10 5139.0
- major_felonies             1 1.1987e+09 2.5403e+10 5148.5
- Date                       1 3.0548e+09 2.7259e+10 5168.3
- mortgage_rate_30_year      1 5.7144e+09 2.9919e+10 5194.3
- ny_median_hh_income        1 5.7447e+09 2.9949e+10 5194.6
- housing_crisis             1 6.9211e+09 3.1125e+10 5205.4
- misdemeanor_offenses       1 7.8615e+09 3.2066e+10 5213.8
- Fed_Interest_Rate          1 1.5652e+10 3.9856e+10 5274.7
- mean_sfr_value             1 1.5046e+11 1.7467e+11 5688.4

Step:  AIC=5095.4
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies + Fed_Interest_Rate + 
    mortgage_rate_30_year + housing_crisis + ny_median_hh_income + 
    Date + mortgage_rate_15_year

                            Df  Sum of Sq        RSS    AIC
+ unemployment_rate          1 8.7708e+08 1.9837e+10 5085.3
+ covid_pandemic             1 8.2284e+08 1.9891e+10 5086.0
- major_felonies             1 4.4378e+07 2.0758e+10 5094.0
- national_median_hh_income  1 6.6672e+07 2.0781e+10 5094.3
<none>                                    2.0714e+10 5095.4
+ Date.L3                    1 9.1094e+06 2.0705e+10 5097.3
+ Date.L1                    1 9.0131e+06 2.0705e+10 5097.3
+ non_seven_major_felonies   1 2.2602e+06 2.0712e+10 5097.4
+ ...1                       1 1.2390e+05 2.0714e+10 5097.4
- mortgage_rate_15_year      1 3.4905e+09 2.4204e+10 5137.0
- housing_crisis             1 4.1653e+09 2.4879e+10 5144.7
- ny_median_hh_income        1 5.1112e+09 2.5825e+10 5155.1
- mortgage_rate_30_year      1 5.3510e+09 2.6065e+10 5157.7
- Date                       1 5.9732e+09 2.6687e+10 5164.3
- Fed_Interest_Rate          1 6.9233e+09 2.7637e+10 5174.1
- misdemeanor_offenses       1 8.3102e+09 2.9024e+10 5187.8
- mean_sfr_value             1 1.5389e+11 1.7460e+11 5690.3

Step:  AIC=5085.28
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies + Fed_Interest_Rate + 
    mortgage_rate_30_year + housing_crisis + ny_median_hh_income + 
    Date + mortgage_rate_15_year + unemployment_rate

                            Df  Sum of Sq        RSS    AIC
+ covid_pandemic             1 2.0904e+09 1.7746e+10 5056.1
- national_median_hh_income  1 8.7451e+06 1.9846e+10 5083.4
- major_felonies             1 1.2556e+08 1.9962e+10 5085.1
<none>                                    1.9837e+10 5085.3
+ non_seven_major_felonies   1 3.0180e+07 1.9807e+10 5086.9
+ Date.L1                    1 1.9329e+07 1.9817e+10 5087.0
+ Date.L3                    1 8.4571e+06 1.9828e+10 5087.2
+ ...1                       1 9.9219e+05 1.9836e+10 5087.3
- unemployment_rate          1 8.7708e+08 2.0714e+10 5095.4
- ny_median_hh_income        1 1.6594e+09 2.1496e+10 5105.8
- Fed_Interest_Rate          1 2.0571e+09 2.1894e+10 5110.9
- housing_crisis             1 2.6254e+09 2.2462e+10 5118.1
- mortgage_rate_15_year      1 4.3395e+09 2.4176e+10 5138.7
- Date                       1 5.6883e+09 2.5525e+10 5153.9
- mortgage_rate_30_year      1 6.1393e+09 2.5976e+10 5158.8
- misdemeanor_offenses       1 9.1801e+09 2.9017e+10 5189.8
- mean_sfr_value             1 1.5119e+11 1.7102e+11 5686.5

Step:  AIC=5056.1
median_sale_price ~ national_median_hh_income + mean_sfr_value + 
    misdemeanor_offenses + major_felonies + Fed_Interest_Rate + 
    mortgage_rate_30_year + housing_crisis + ny_median_hh_income + 
    Date + mortgage_rate_15_year + unemployment_rate + covid_pandemic

                            Df  Sum of Sq        RSS    AIC
- national_median_hh_income  1 1.1519e+06 1.7748e+10 5054.1
<none>                                    1.7746e+10 5056.1
+ Date.L3                    1 5.6561e+07 1.7690e+10 5057.2
+ Date.L1                    1 5.2264e+07 1.7694e+10 5057.3
+ non_seven_major_felonies   1 1.3916e+07 1.7732e+10 5057.9
+ ...1                       1 1.4064e+05 1.7746e+10 5058.1
- major_felonies             1 1.1648e+09 1.8911e+10 5071.9
- ny_median_hh_income        1 1.4375e+09 1.9184e+10 5075.9
- housing_crisis             1 2.0375e+09 1.9784e+10 5084.5
- covid_pandemic             1 2.0904e+09 1.9837e+10 5085.3
- unemployment_rate          1 2.1446e+09 1.9891e+10 5086.0
- Fed_Interest_Rate          1 2.4270e+09 2.0173e+10 5090.0
- mortgage_rate_15_year      1 3.4191e+09 2.1166e+10 5103.4
- mortgage_rate_30_year      1 4.5321e+09 2.2278e+10 5117.8
- Date                       1 5.1246e+09 2.2871e+10 5125.1
- misdemeanor_offenses       1 7.4705e+09 2.5217e+10 5152.5
- mean_sfr_value             1 1.1034e+11 1.2809e+11 5607.5

Step:  AIC=5054.12
median_sale_price ~ mean_sfr_value + misdemeanor_offenses + major_felonies + 
    Fed_Interest_Rate + mortgage_rate_30_year + housing_crisis + 
    ny_median_hh_income + Date + mortgage_rate_15_year + unemployment_rate + 
    covid_pandemic

                            Df  Sum of Sq        RSS    AIC
<none>                                    1.7748e+10 5054.1
+ Date.L3                    1 5.7710e+07 1.7690e+10 5055.2
+ Date.L1                    1 5.2721e+07 1.7695e+10 5055.3
+ non_seven_major_felonies   1 1.4390e+07 1.7733e+10 5055.9
+ national_median_hh_income  1 1.1519e+06 1.7746e+10 5056.1
+ ...1                       1 1.8170e+05 1.7747e+10 5056.1
- ny_median_hh_income        1 1.6558e+09 1.9403e+10 5077.1
- housing_crisis             1 2.0627e+09 1.9810e+10 5082.9
- covid_pandemic             1 2.0980e+09 1.9846e+10 5083.4
- unemployment_rate          1 2.2283e+09 1.9976e+10 5085.2
- major_felonies             1 2.3529e+09 2.0100e+10 5087.0
- Fed_Interest_Rate          1 2.4297e+09 2.0177e+10 5088.0
- mortgage_rate_15_year      1 3.5864e+09 2.1334e+10 5103.7
- mortgage_rate_30_year      1 4.6922e+09 2.2440e+10 5117.8
- misdemeanor_offenses       1 7.4694e+09 2.5217e+10 5150.5
- Date                       1 5.1841e+10 6.9589e+10 5434.7
- mean_sfr_value             1 1.4032e+11 1.5807e+11 5664.4
Code
summary(wls_model_step)

Call:
stats::lm(formula = median_sale_price ~ mean_sfr_value + misdemeanor_offenses + 
    major_felonies + Fed_Interest_Rate + mortgage_rate_30_year + 
    housing_crisis + ny_median_hh_income + Date + mortgage_rate_15_year + 
    unemployment_rate + covid_pandemic, data = ny_housing_data_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-23363.5  -5683.0    507.9   4358.9  23964.2 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            2.602e+04  3.110e+04   0.837    0.403    
mean_sfr_value         6.828e-01  1.483e-02  46.032  < 2e-16 ***
misdemeanor_offenses  -4.127e+00  3.886e-01 -10.620  < 2e-16 ***
major_felonies        -4.664e+00  7.825e-01  -5.961 7.88e-09 ***
Fed_Interest_Rate      4.782e+03  7.895e+02   6.057 4.67e-09 ***
mortgage_rate_30_year -4.887e+04  5.805e+03  -8.418 2.34e-15 ***
housing_crisis         1.482e+04  2.655e+03   5.581 5.85e-08 ***
ny_median_hh_income    1.705e+01  3.410e+00   5.000 1.04e-06 ***
Date                   2.056e+01  7.350e-01  27.979  < 2e-16 ***
mortgage_rate_15_year  4.473e+04  6.079e+03   7.359 2.27e-12 ***
unemployment_rate     -2.746e+03  4.733e+02  -5.801 1.86e-08 ***
covid_pandemic         1.714e+04  3.046e+03   5.629 4.58e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8138 on 268 degrees of freedom
Multiple R-squared:  0.9968,    Adjusted R-squared:  0.9967 
F-statistic:  7569 on 11 and 268 DF,  p-value: < 2.2e-16
Code
graphics::plot(wls_model_step, which = 2)

Code
graphics::hist(wls_model_step$residuals)

Code
#heteroskedasticity check
graphics::plot(wls_model_step$residuals ~ wls_model_step$fitted.values, 
      main = "Heteroskedastic Residuals", 
      xlab = "Predicted Values", 
      ylab = "Residuals")
graphics::abline(h=0, col="red")

Code
#residuals vs fitted plot
graphics::plot(wls_model_step, which = 1)

Code
lmtest::dwtest(wls_model_step)

    Durbin-Watson test

data:  wls_model_step
DW = 0.35949, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

3.3 Model 3 - Bootstrap

3.3.1 Model Characteristics

Variables Included

  • Date
  • mean_sfr_value
  • Fed_Interest_Rate
  • mortgage_rate_15_year
  • mortgage_rate_30_year
  • log(ny_median_hh_income)
  • log(major_felonies)
  • log(misdemeanor_offenses)
  • unemployment_rate
  • housing_crisis
  • covid_pandemic

Variables Excluded

  • national_median_hh_income
  • non_seven_major_felonies

3.3.2 Initial Findings

  • Summary of Findings - After we performed log transformations on our variables, we found an increase in \(r^2_{Adjusted}\) to 99.66 and a decrease in the standard errors of our coefficients. This suggests that the transformed variables are better suited for modeling. However, this gives us a high likelihood of overfitting our model now. To address this, we will perform a bootstrap model to estimate the standard errors and confidence intervals of our coefficients.

3.3.3 Challenges and Adjustments

  • Overfitting - a likely problem, we computed confidence intervals and standard errors and plotted them. None of the individual predictor CIs cross over the 0 line - meaning we have confidence that they are not ambiguous and can be accurately applied to this model going forward.
Code
# fit linear model with log(<yearly variables>)
bootstrap_model <- stats::lm(median_sale_price ~ Date + mean_sfr_value + Fed_Interest_Rate + mortgage_rate_15_year + 
                  mortgage_rate_30_year + unemployment_rate + housing_crisis + covid_pandemic + log(ny_median_hh_income) + 
                  log(non_seven_major_felonies) + log(major_felonies) + log(misdemeanor_offenses), 
                  data = ny_housing_data_clean)
# view summary statistics                  
summary(bootstrap_model)

Call:
stats::lm(formula = median_sale_price ~ Date + mean_sfr_value + 
    Fed_Interest_Rate + mortgage_rate_15_year + mortgage_rate_30_year + 
    unemployment_rate + housing_crisis + covid_pandemic + log(ny_median_hh_income) + 
    log(non_seven_major_felonies) + log(major_felonies) + log(misdemeanor_offenses), 
    data = ny_housing_data_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-23546.7  -5620.8    135.2   4530.2  26103.2 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.853e+05  2.927e+05   2.000   0.0465 *  
Date                           2.020e+01  7.848e-01  25.738  < 2e-16 ***
mean_sfr_value                 7.130e-01  1.447e-02  49.258  < 2e-16 ***
Fed_Interest_Rate              4.481e+03  8.357e+02   5.362 1.78e-07 ***
mortgage_rate_15_year          3.917e+04  6.033e+03   6.493 4.08e-10 ***
mortgage_rate_30_year         -4.305e+04  5.809e+03  -7.411 1.65e-12 ***
unemployment_rate             -3.192e+03  4.975e+02  -6.416 6.34e-10 ***
housing_crisis                 1.370e+04  2.790e+03   4.913 1.57e-06 ***
covid_pandemic                 1.493e+04  3.093e+03   4.826 2.35e-06 ***
log(ny_median_hh_income)       9.982e+04  2.286e+04   4.367 1.80e-05 ***
log(non_seven_major_felonies) -8.089e+03  1.710e+04  -0.473   0.6366    
log(major_felonies)           -4.773e+04  8.952e+03  -5.332 2.07e-07 ***
log(misdemeanor_offenses)     -9.661e+04  1.532e+04  -6.307 1.17e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8200 on 267 degrees of freedom
Multiple R-squared:  0.9968,    Adjusted R-squared:  0.9966 
F-statistic:  6833 on 12 and 267 DF,  p-value: < 2.2e-16
Code
# fit reduced model
reduced_bootstrap_model <- stats::lm(median_sale_price ~ Date + mean_sfr_value + Fed_Interest_Rate + mortgage_rate_15_year + 
                                    mortgage_rate_30_year + unemployment_rate + housing_crisis + covid_pandemic + 
                                    log(ny_median_hh_income) + log(major_felonies) + log(misdemeanor_offenses), 
                                    data = ny_housing_data_clean)
# view summary statistics  
summary(reduced_bootstrap_model)   

Call:
stats::lm(formula = median_sale_price ~ Date + mean_sfr_value + 
    Fed_Interest_Rate + mortgage_rate_15_year + mortgage_rate_30_year + 
    unemployment_rate + housing_crisis + covid_pandemic + log(ny_median_hh_income) + 
    log(major_felonies) + log(misdemeanor_offenses), data = ny_housing_data_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-23614.1  -5599.5    126.6   4643.6  25987.5 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                6.326e+05  2.746e+05   2.304    0.022 *  
Date                       2.033e+01  7.319e-01  27.778  < 2e-16 ***
mean_sfr_value             7.114e-01  1.405e-02  50.626  < 2e-16 ***
Fed_Interest_Rate          4.361e+03  7.953e+02   5.484 9.62e-08 ***
mortgage_rate_15_year      3.950e+04  5.983e+03   6.603 2.15e-10 ***
mortgage_rate_30_year     -4.333e+04  5.770e+03  -7.511 8.78e-13 ***
unemployment_rate         -3.156e+03  4.910e+02  -6.428 5.88e-10 ***
housing_crisis             1.334e+04  2.675e+03   4.985 1.11e-06 ***
covid_pandemic             1.503e+04  3.081e+03   4.879 1.83e-06 ***
log(ny_median_hh_income)   9.476e+04  2.016e+04   4.700 4.17e-06 ***
log(major_felonies)       -4.941e+04  8.210e+03  -6.018 5.78e-09 ***
log(misdemeanor_offenses) -1.023e+05  9.485e+03 -10.784  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8188 on 268 degrees of freedom
Multiple R-squared:  0.9968,    Adjusted R-squared:  0.9966 
F-statistic:  7476 on 11 and 268 DF,  p-value: < 2.2e-16
Code
graphics::plot(reduced_bootstrap_model, which = 1)

Code
lmtest::dwtest(reduced_bootstrap_model)

    Durbin-Watson test

data:  reduced_bootstrap_model
DW = 0.35135, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
Code
# convert model into function
reduced_bootstrap_model_fit <- function(data){
  stats::lm(median_sale_price ~ Date + mean_sfr_value + Fed_Interest_Rate + mortgage_rate_15_year + 
            mortgage_rate_30_year + unemployment_rate + housing_crisis + covid_pandemic + log(ny_median_hh_income) + 
            log(major_felonies) + log(misdemeanor_offenses), data = data)
}
# create bootstrap function
bootstrap_function <- function(data, indices) {
  resampled_data <- data[indices, ] # resample the data
  model <- reduced_bootstrap_model_fit(resampled_data) # fit the linear model on the resampled data
  return(coef(model)) 
}
Code
# set seed for reproducibility
bootstrap_seed <- 45
set.seed(bootstrap_seed)

# bootstrapping
bootstrap_results <- boot::boot(data = ny_housing_data_clean , statistic = bootstrap_function, R = 1000)

summary(bootstrap_results) # view summary statistics
          Length Class  Mode     
t0           12  -none- numeric  
t         12000  -none- numeric  
R             1  -none- numeric  
data         17  tbl_df list     
seed        626  -none- numeric  
statistic     1  -none- function 
sim           1  -none- character
call          4  -none- call     
stype         1  -none- character
strata      280  -none- numeric  
weights     280  -none- numeric  
Code
# Calculate CIs for each coefficient
for (i in 1:length(bootstrap_results$t0)) {
    ci <- boot.ci(bootstrap_results, type = "perc", index = i)
    print(ci)
}
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   (  52937, 1385222 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   (18.93, 22.48 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   ( 0.6752,  0.7418 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   (1823, 6199 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   (27609, 54316 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   (-56806, -31447 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   (-5999, -1587 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   ( 8562, 17141 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   ( 4496, 22918 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   ( 40255, 134748 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   (-73908, -24608 )  
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)

Intervals : 
Level     Percentile     
95%   (-122227,  -85911 )  
Calculations and Intervals on Original Scale
Code
# Initialize a data frame to store confidence interval results
ci_data <- data.frame(
    Parameter = names(bootstrap_results$t0),  
    Estimate = bootstrap_results$t0,          # point estimates from original data
    Lower = numeric(length(bootstrap_results$t0)),  # store lower bounds
    Upper = numeric(length(bootstrap_results$t0))   # store upper bounds
)

# calculate confidence intervals for each parameter
for (i in 1:length(bootstrap_results$t0)) {
    ci <- boot::boot.ci(bootstrap_results, type = "perc", index = i)$percent
    ci_data$Lower[i] <- ci[4]  # lower bound of the confidence interval
    ci_data$Upper[i] <- ci[5]  # upper bound of the confidence interval
}

# create the plot
ggplot2::ggplot(ci_data, ggplot2::aes(x = Parameter, y = Estimate)) +
    ggplot2::geom_point(size = 3) +  # plot point estimates
    ggplot2::geom_errorbar(ggplot2::aes(ymin = Lower, ymax = Upper), width = 0.2) +  # plot CIs
    ggplot2::theme_minimal() +
    ggplot2::coord_flip() +
    ggplot2::labs(
        title = "Confidence Intervals for Model Parameters",
        x = "Parameter",
        y = "Estimate"
    ) +
    ggplot2::geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
    ggplot2::scale_y_continuous(labels = scales::comma) +
    ggthemes::theme_economist()

Code
# pull predictors and response variable
predictors <- cbind(
  Date = ny_housing_data_clean$Date,
  mean_sfr_value = ny_housing_data_clean$mean_sfr_value,
  Fed_Interest_Rate = ny_housing_data_clean$Fed_Interest_Rate,
  mortgage_rate_15_year = ny_housing_data_clean$mortgage_rate_15_year,
  mortgage_rate_30_year = ny_housing_data_clean$mortgage_rate_30_year,
  unemployment_rate = ny_housing_data_clean$unemployment_rate,
  housing_crisis = ny_housing_data_clean$housing_crisis,
  covid_pandemic = ny_housing_data_clean$covid_pandemic,
  log_ny_median_hh_income = log(ny_housing_data_clean$ny_median_hh_income),
  log_major_felonies = log(ny_housing_data_clean$major_felonies),
  log_misdemeanor_offenses = log(ny_housing_data_clean$misdemeanor_offenses)
)

response_variable <- ny_housing_data_clean$median_sale_price

# set CV params
cv_params <- caret::trainControl(
  method = "cv",
  number = 10
)

cv_model <- caret::train(
  x = predictors,
  y = response_variable,
  method = "lm",
  trControl = cv_params
)

cv_model
Linear Regression 

280 samples
 11 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 252, 252, 252, 252, 252, 252, ... 
Resampling results:

  RMSE      Rsquared   MAE    
  8546.728  0.9968758  6650.75

Tuning parameter 'intercept' was held constant at a value of TRUE
Code
# model summaries
reduced_bootstrap_model_summary <- summary(reduced_bootstrap_model)
final_lm_model_summary <- summary(final_lm_model)
wls_model_step_summary <- summary(wls_model_step)

# extract metrics
# calculate RMSE
reduced_bootstrap_model_rmse <- round(sqrt(mean(reduced_bootstrap_model_summary$residuals^2)),4)
final_lm_model_rmse <- round(sqrt(mean(final_lm_model_summary$residuals^2)),4)
wls_model_step_rmse <- round(sqrt(mean(wls_model_step_summary$residuals^2)),4)
cv_rmse <- round(cv_model$results$RMSE,4)

# extract adj r^2 and dw score
reduced_bootstrap_model_adjr2 <- round(reduced_bootstrap_model_summary$adj.r.squared,4)
reduced_bootstrap_model_dw <- round(as.numeric(lmtest::dwtest(reduced_bootstrap_model)$statistic),4)
final_lm_model_adjr2 <- round(final_lm_model_summary$adj.r.squared, 4)
final_lm_model_dw <- round(as.numeric(lmtest::dwtest(final_lm_model)$statistic),4)
wls_model_step_adjr2 <- round(wls_model_step_summary$adj.r.squared,4)
wls_model_step_dw <- round(as.numeric(lmtest::dwtest(wls_model_step)$statistic), 4)

# create table
model_comparison_table <- data.frame(
  Model = c("OLS", "Stepwise WLS", "Bootstrap", "Bootstrap (CV)"),
  RMSE = c(final_lm_model_rmse, wls_model_step_rmse, reduced_bootstrap_model_rmse, cv_rmse),
  Adjusted_R2 = c(final_lm_model_adjr2, final_lm_model_adjr2, reduced_bootstrap_model_adjr2, NA),
  Durbin_Watson = c(final_lm_model_dw, wls_model_step_dw, reduced_bootstrap_model_dw, NA)
)

model_comparison_table

4 Conclusion

Based on the comprehensive analysis presented in the “ITEC 621 Project,” which centered on predicting median housing prices in New York City, several key insights and challenges emerged across the various modeling approaches used. The project utilized a substantial dataset compiled from reputable sources such as Zillow, the Federal Reserve, and the New York Police Department, integrating economic indicators like interest rates, housing market values, crime statistics, and socio-economic data, to explore their potential impacts on housing prices.

4.1 Model Assessments

The project applied multiple regression analysis methods to delve deeply into underlying patterns. The Ordinary Least Squares (OLS) model highlighted persistent issues of heteroskedasticity, non-normality in residuals, and serial correlation among residuals. This initial model included variables such as the Federal Reserve Interest Rates and crime statistics but excluded broader income measures, acknowledging their limited explanatory power in the face of local economic conditions.

To address the heteroskedasticity observed, a Weighted Least Squares (WLS) model, enhanced by stepwise regression, was deployed. This adjustment improved the fit by accounting for variance inconsistencies across data, corroborating the significance of all included variables. Furthermore, the refinement depicted how an appropriate model technique could moderate variance-driven distortions, although it still faced the challenge of addressing serial correlation thoroughly.

4.2 Log Transformation and Bootstrapping

To overcome the risk of overfitting associated with high initial R² values in the models, log transformations for variables like median household income and crime rates were conducted. The transformed model revealed an improved Adjusted R² score of 99.66%, suggesting better adaptability to changes in underlying variable distributions. However, to validate model fidelity and mitigate overfitting, a Bootstrapping approach was leveraged, ensuring the robustness of coefficient estimates by calculating confidence intervals and standard errors across multiple resamples.

4.3 Cross-Validation

Finally, cross-validation captured the model’s capability to generalize over unseen data folds, further supporting the model’s stability and predictive capability.

4.4 Final Conclusion

Overall, the analysis underscores the complexity of modeling housing prices, particularly in a dynamic and multifaceted economic environment like New York City. The iterative refinement of models—from OLS to WLS and then incorporating bootstrapping—reinforced the importance of adapting methods suitable to data characteristics to yield precise and actionable insights. Despite challenges like serial correlation, which prompt future exploration into time-series models, the project offers a data-driven framework to guide stakeholders in housing sectors on leveraging economic indicators in predictive modeling accurately.